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Abstract 

Background: Non-negative nnatrix factorization (NMF) lias been sliown to be a powerful tool for clustering gene 
expression data, which are widely used to classify cancers. NMF aims to find two non-negative nnatrices whose 
product closely approximates the original matrix. Traditional NMF methods minimize either the norm or the 
Kullback-Leibler distance between the product of the two matrices and the original matrix. Correntropy was recently 
shown to be an effective similarity measurement due to its stability to outliers or noise. 

Results: We propose a maximum correntropy criterion (MCC)-based NMF method (NMF-MCC) for gene expression 
data-based cancer clustering. Instead of minimizing the norm or the Kullback-Leibler distance, NMF-MCC maximizes 
the correntropy between the product of the two matrices and the original matrix. The optimization problem can be 
solved by an expectation conditional maximization algorithm. 

Conclusions: Extensive experiments on six cancer benchmark sets demonstrate that the proposed method is 
significantly more accurate than the state-of-the-art methods in cancer clustering. 



Background 

Because cancer has been a leading cause of death in the 
world for several decades, the classification of cancers is 
becoming more and more important to cancer treatment 
and prognosis [1,2], With advances in DNA microarray 
technology, it is now possible to monitor the expression 
levels of a large number of genes at the same time. There 
have been a variety of studies on analyzing DNA microar- 
ray data for cancer class discovery [3-5]. Such methods are 
demonstrated to outperform the traditional, morpholog- 
ical appearance-based cancer classification methods. In 
such studies, different cancer classes are discriminated by 
their corresponding gene expression profiles [1]. 

Several clustering algorithms have been used to identify 
groups of similar expressed genes. Non-negative matrix 
factorization (NMF) was recently introduced to analyze 
gene expression data and this method demonstrated supe- 
rior performance in terms of both accuracy and stability 
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[6-8]. Gao and Church [3] reported an effective unsuper- 
vised method for cancer clustering with gene expression 
profiles via sparse NMF (SNMF). Carmona et al. [9] pre- 
sented a methodology that was able to cluster closely 
related genes and conditions in sub-portions of the data 
based on non-smooth non-negative matrix factorization 
(nsNMF), which was able to identify localized patterns in 
large datasets. Zheng et al. [5,7] applied penalized matrix 
decomposition (PMD) to extract meta-samples from gene 
expression data, which could captured the inherent struc- 
tures of samples that belonged to the same class. 

NMF approximates a given gene data matrix, X, as a 
product of two low-rank nonnegative matrices, H and W, 
as X ^ HW, This is usually formulated as an optimiza- 
tion problem, where the objective function is to minimize 
either the I2 norm or the Kullback-Leibler (KL) distance 
[10] between X and HW. Most of the improved NMF algo- 
rithms are also based on the minimization of these two 
distances while adding the sparseness term [3], the graph 
regularization term [11], etc. Sandler and Lindenbaum 
[12] argued that measuring the dissimilarity of W and HW 
by either the I2 norm or the KL distance, even with addi- 
tional bias terms, was inappropriate in computer vision 
applications due to the nature of errors in images. Sandler 
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and Lindenbaum [12] proposed a novel NMF with earth 
mover s distance (EMD) metric by minimizing the EMD 
error between X and //W^. The proposed NMF-EMD algo- 
rithm demonstrated significantly improved performance 
in two challenging computer vision tasks, i.e., texture clas- 
sification and face recognition. Liu et al. [4] tested a family 
of NMF algorithms using a -divergence with different a 
values as dissimilarities between X and HW for clustering 
cancer gene expression data. 

It is widely acknowledged that DNA microarry data con- 
tain many types of noise, especially experimental noise. 
Recently, correntropy was shown to be an effective sim- 
ilarity measurement in information theory due to its 
stability to outliers or noise [13]. However, it has not 
been used in the analysis of microarray data. In this 
paper, we propose a novel form of NMF that maximizes 
the correntropy. We introduce a new NMF algorithm 
with a maximum correntropy criterion (MCC) [13] for 
the gene expression data-based cancer clustering prob- 
lem. We call it NMF-MCC. The goal of NMF-MCC is 
to find a meta-sample matrix, /f, and a coding matrix, 
W, such that the gene expression data matrix, X, is 
as correlative to the product of H and W as possible 
under MCC. 

Related works 

He et al. [13] recently developed a face recognition algo- 
rithm, correntropy-based sparse representation (CESR), 
based on MCC. CESR tries to find a group of sparse 
combination coefficients to maximize the correntropy 
between the facial image vector and the linear combina- 
tion of faces in the database. He et al. [13] demonstrated 
that CESR was much more effective in dealing with the 
occlusion and corruption problems of face recognition 
than the state-of-the-art methods. However, CESR learns 
only the combination coefficients while the basis faces 
(the faces in the database) are fixed. Comparing to CESR, 
NMF-MCC can learn both the combination coefficients 
and the basis vectors jointly, which allows the algorithm 
to obtain more basis vectors for better representation of 
the data points. Zafeiriou and Petrou [14] addressed the 
problem of NMF with kernel functions instead of inner 
products and proposed the projected gradient kernel 
nonnegative matrix factorization (PGK-NMF) algorithm. 
Both NMF-MCC and PGK-NMF employ kernel functions 
to map the linear data space to a non-linear space. How- 
ever, as we show later, NMF-MCC computes different 
kernels for different features, while PGK-NMF computes 
a single kernel for the whole feature vector. Thus, NMF- 
MCC allows the algorithm to assign different weights to 
different features and emphasizes the discriminant fea- 
tures with high weights, thus achieving feature selection. 
In contrast, like most kernel based methods, PGK-NMF 
simply replaces the inner product by the kernel-function 



and treats the features equally, thus there is no feature 
selection function. 

Methods 

In this section, we first briefly introduce the traditional 
NMF method. We then propose our novel NMF-MCC 
algorithm by maximizing the correntropy in NMF. We 
further propose a expectation conditional maximization- 
based approach to solve the optimization problem. 

Nonnegative matrix factorization 

NMF is a matrix factorization algorithm that focuses on 
the analysis of data matrices whose elements are nonneg- 
ative. Consider a gene expression dataset that consists of 
D genes in N samples. We denote it by a matrix X = 
[xi,'-' ^xm] ^ ^^^^ of size D X N, and each column of 
X is a sample vector containing D genes. NMF aims to 
find two non-negative matrices, H =[hdk\ ^ 9^^^^ and 
W =[ Wkn\ ^ ^^"^^^j whose product closely approximates 
the original matrix X: 

X ^ HW, (1) 

Matrix H is of size D x K, with each of the K columns 
defining a meta-sample and each entry, h^ia in H repre- 
senting the expression level of gene d over meta-sample 
k. Matrix W is of size K x A^, with each of the n columns 
representing the meta-sample expression pattern of the 
corresponding sample, and each entry, w^^, representing 
the coefficient of meta-sample k over sample n. Figure 1 
shows an example of the factorization of a gene expres- 
sion matrix X with D = 2308 genes and N = 83 samples 
as the product of the meta-sample matrix H with K = 4 
meta-samples and the coding matrix W, 

The factorization is quantified by an objective function 
that minimizes some distance measure, such as: 

• I2 norm distance: One simple measure is the square 
of the I2 norm distance (also known as the Frobenius 
norm or the Euclidean distance) between two 
matrices, which is defined as: 

D N / K \^ 

F''=J2J2{^dn-J2^dkWkn) . (2) 

d=l n=l \ k=l I 

• KuUback - Leibler (KL) divergence: The second 
one is the divergence between two matrices [10], 
which is defined as: 

= E E - + • 

(3) 
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Maximum correntropy criterion for NMF 

Another thing that has to be changed is that the definition 
of correntropy is not subject to the kernel being Gaussian 
as they seem to imply through the text, so for instance 
when they define they can say E(k(x-y)) and one of the 
common choices of k is the Gaussian kernel giving.... 

Correntropy is a nonlinear similarity measure between 
two random variables, x and 3/ [13,15,16], defined as 



Voix.y) =E[ka{x-y)], 



(4) 



where ka is a kernel that satisfies the Mercer theory and 
E['] is the expectation. One of the common choices of 



is the Gaussian kernel given as k^ {x — y) = exp(- 



{x-yV 

2a2 



In practice, the joint probability density function of 
X and y is unknown and only a finite amount of data 
{(^i>yi)}> i = 1, • • • , / is available. Therefore, the sample 
correntropy is estimated by 



1 ^ 

Va (x, y) = f<cT (xi - yd, (5) 

i=l 

Based on Eq. (5), a general similarity measurement 
between any two discrete gene expression vectors was 
proposed [17]. They introduced the correntropy induced 
metric (CIM) for any two gene sample vectors x = 
[xi, ' • • ,xdV and 3/ =[3/1, • • • .yoV^ as: 



CIM(x^y) = (^<A0) + ^ J2j<a(xd-yd)^ 



(6) 



where = Xd—yd is defined as the error. For adaptive sys- 
tems, we can define the maximum correntropy criterion 
(MCC) [18] as 



D 



max kg (xd - yd)> 

d^l 

kaixd-yd) = exp 



(7) 



(xd - yd) 

2a2 



where B is a parameter to be specified later. We must 
notice the difference between MCC and common kernel 
criterion used in [14]. The Gaussian kernel function of 
vectors x and y is defined as 



ka {x —y) = exp 



= exp 



\\x-y\\ 

J2d=i(^d-ydf 

2a2 



(8) 



We can see that the kernel is applied to the entire fea- 
ture vector, X, and each feature Xd>d = 1 • • • , D is treated 
equally with the same kernel parameter. However, in (7), 
kernel functions are applied to different functions. This 
can allow the algorithm to learn different kernel parame- 
ters as we will introduce later. In this way, we can assign 
different weights to different features and thus implement 
feature selection. 

Our goal is to find a meta-sample matrix, //, and a 
coding matrix, W, such that HW is as correlative to 
X as possible under MCC as described in Eq. (7). To 
extend MCC from vector space to matrix space R^^^, 
we replace ed = (xd — yd) with the I2 norm distance 
between the samples of X and Y = HW as ed = 

^Yln=i(^dn -Jdn)^^ where ydn is the {d, n)-th item of Y, 
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andy^^ = Ylk^i ^dk^kn- Moreover, the factorization sys- 
tem parameter should be set to 0 = (//, W) under the 
framework of NMF-MCC. By substituting newly defined 
ed and B to (7), we can formulate the problem of NMF- 
MCC as the following optimization problem: 

max F(H, W) 

H,W 

s.t. // > 0, W>0. 

D 

F(H, W) = J2 Scried) 
d=i 



^(^dn -^hdkmn)'^ 



d=l W n=l k=l / 

J2n=l(^dn - J2k=l ^dk^knf 



D 

= Y^exp 

d=i 



2a2 



(9) 



We should notice the significant difference between NMF- 
MCC and CESR. As a supervised learning algorithm, 
the CESR represents a test data point, Xt, as a linear 
combination of all the the training data points as Xt ^ 
J2n=i ^n^nt = Xwt and wt =[ Wit, • • • , WMtV is the com- 
bination coefficient vector. CESR aims to find the optimal 
Wt to maximize the correntropy between Xt and Xwt^ Sim- 
ilarly, NMF-MCC also tries to represent a data point Xn 
as a linear combination of some basis vectors as Xyi 

J2k=l^kmn = ^^n and Wn =[winr " yWKnV is the 

combination coefficient vector. Differently from CESR, 
NMF-MCC aims to find not only the optimal Wn but 
also the basis vectors in H to maximize the correntropy 
between Xn and Hwny n = 1, • • • , A/^. The internal differ- 
ence between NMF-MCC and CESR lies in whether to 
learn basis vectors or not. 

In order to solve the optimization problem, we rec- 
ognize that the expectation conditional maximization 
(ECM) method [19] can be applied. Based on the the- 
ory of convex conjugate functions [20], we can derive the 
following proposition that forms the basis to solve the 
optimization problem in (9): 

Proposition !♦ There exists a convex conjugate function 
ofg(Zy a) such that 



By substituting Eq. (10) into (9), we have the aug- 
mented objective function in an enlarged parameter 
space 



max F(H,W,p) 

H,W,P 



s.t. H>0,W>0. 



D / N K \ 

F{H, W,p) = ^ ipd^(xdn-^hdkWknf-(p(Pd)] . 

d=l \ n=l k=l I 



(11) 



where superscript ^ is the convex conjugate function (p of 
g{z) defined in Proposition 1, and p =[ pi, • • • , are 
the auxiliary variables. 

According to Proposition 1, for fixed H and W, the 
following equation holds: 



F(H, W) = maxFiH, W, p). 
p 



It follows that 



(12) 



maxF(H, W) = max 



maxF(H, W, p) 
p 



(13) 



= maxF{H,W,p). 

H,W,p 



That is, maximizing F(H, W) is equivalent to maximizing 
the augmented function p). 



The NMF-MCC Algorithm 

The traditional NMF can be solved by the expectation- 
maximization (EM) algorithm [21]. However, in the case 
of MCC-based NMF, EM must be replaced by ECM 
because there is more than one parameter. Figure 2 shows 
the outline of ECM, which is described in more detail 
below. 

1. E'Step: Compute p given the current estimations of 
the meta-sample matrix H and the coding matrix W 



g(z, cr) = sup 



(^^^-^(^)j 



(10) 



and for a fixed z, the supremum is reached atg = —g(Zf a). 



Pd = -S 



N / K \2 

n=l \ k=l I 



(14) 
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Gene Expression Data Matrix X 
t=1 




Gene Weighting Vector -rho 






Meta-Sample Matrix H Coding Matrix W 



t++ 



Figure 2 Outline of tiie ECM-based NMF-MCC algoritlim. 



where t means the t-th iteration. In this study, the 
kernel size (bandwidth) a^^ is computed by 

D N / K \2 

d=l w=l \ k=\ I 

where ^ is a parameter to control the sparseness of p^. 
2. CM-steps: In the CM-step, given p^, we try to 

optimize the following function respect to H and W\ 

H,w \^ V ) ) 

=argmax True \{X-HW)^ diag{p^){X-HW)\ 



H,W 

s.L /f > 0, W>0, 



where diag(-) is an operator that converts the vector 

/O to a diagonal matrix. 

By introducing a dual objective function, 

0(H, W) =Tmc ^(X - HWfdiag{-p^){X - HW)]^ 

=Trac^'^ diag{-p^)x]^-2Trac^'^ diag{-p^)HW^ 
+ Trac diag{-p^)HW^ , 

(17) 

the optimal problem in (16) can be reformulated as 
the following dual problem: 



(//^+\ =argmin 0(H, W) 

s,L // > 0, > 0. 



(18) 



Let 0^/^ and xj/i^yi be the Lagrange multiplier for 
constraints h^i^ > 0 and w/^^ > 0, respectively, and 
^ =[ (l^dk] and vj/ =[ The Lagrange C is 



C =Tmc ^X'^diag(-p^)X^ - 2Trac \^x'^ diag{- p^)HW^ 
+ Trac ^'^H^diag{-p^)HW^ + Trac 

(19) 



Trac 



The partial derivatives of C with respect to H and W 
are 



— = - 2diag{-p^)XW^ + 2diag{-p^)HWW^ + ^ 
dH 



(20) 



and 

dC 
dW 



= -2H^diag{-p^)X + 2H'^ diag(- p^)HW + * 

(21) 

Using the Karush-Kuhn-Tucker optimal conditions, 
i.e., (I)dkhdk = 0 and fkn^kn = 0, we get the following 
equations for h^k and wi^„: 



and 



- 2idiagi-p')XW^Uhdk 

+ 2(diag(-p')HWW^)dkhdk = 0 

- 2{H'^ diag{-p*)X)k„Wkn 

+ 2(H^diag(-p')HW)knWkn = 0 



(22) 



(23) 



(16) 



These equations lead to the following updating rules 
to maximize the expectation in (13). 
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• The meta-sample matrix H, conditioned on the 
coding matrix W: 



(diag(-p')XW'^)dk 



(24) 



• The coding matrix W conditioned on the newly 
estimated meta-sample matrix 



{H'-^^^diag{-p')X)un 

(25) 



We should note that if we exchange the numerator 
and denominator in (24) and (25), new update 
formulas will be yield. The new update rules are dual 
for (24) and (25), and our experimental results show 
that the dual update rules achieve similar clustering 
performances as (24) and (25). 

Algorithm 1 summarizes the optimization procedure. 

Algorithm 1 NMF-MCC Algorithm. 
Require: Input gene expression data matrix X; 
Require: Initial meta-sample gene matrix and coding 
matrix W^; 
for ^ = 1, • • • , r do 

Update the auxiliary variables p^ as in (14); 
Update the meta-sample matrix H^^^ as in (24); 
Update the coding matrix W^^+^ as in (25); 
end for 

Output// = //^+i and W = 



Proof of convergence 

In this section, we will prove that the objective function in 
(16) is nonincreasing under the updating rules in (24) and 
(25). 

Theorem 1. The objective function in (16) is nonin- 
creasing under the update rules (24) and (25). 



To prove the above theorem, we first define an auxiliary 
function. 

Definition 1. G{Wy w') is an auxiliary function for F{w) 
if the conditions 



are satisfied. 



(26) 



The auxiliary function is quite useful because of the 
following lemma: 

Lemma !♦ If G is an auxiliary function of F, then F is 
nonincreasing under the update 



w^'^^ = argminG(w, w^). 



(27) 



We refer the readers to [22] for the proof of this 
lemma. Now, we show that the updating rule of (25) 
is exactly the update in (27) with a proper auxil- 
iary function. We denote the objective function in (16) 
as O: 

D / N K \ 

0 = Y^{pd ^ipCdn - XI ^dk^nf 1 

d=l \ n=l k=l I (28) 

= Frac [(X - HWf diag{p^){X - HW)^ . 

Considering any element, in Wy we use F]^yi to 
denote the part of the objective function in (16) that is 
relevant only to vi/^^^. It is easy to check that 



-\-2H'^diag{-p')HW) (29) 

/ kn 

) =2 (H^diag(-p^)H) 

/ kn ^ ' 



P' — 
^kn — 



Since the updating rule is essentially based on elements, 
it is sufficient to show that each is nonincreasing under 
the update step of (25). 



Table 1 Summary of the six cancer gene expression datasets used to test the NMF-MCC algorithm 


Dataset name 


Diagnostic tasic 


Samples (AT) 


Genes (D) 


Cancer Classes {K) 


Ref 


Leukemia 


Acute myelogenous leul<emia 


72 


5327 


3 


[25] 


Brain Tumor 


5 liuman brain tumor types 


90 


5920 


5 


[26] 


Lung Cancer 


4 lung cancer types and normal tissues 


203 


12600 


5 


[27] 


9 Tumors 


9 various human tumor types 


60 


5726 


9 


[28] 


SRBCT 


Small, round blue cell tumors 


83 


2308 


4 


[29] 


DLBCL 


Diffuse large B-cell lymphomas 


77 


5469 


2 


[24] 
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Lemma 2. Function 

G{w, <) = f^„(<) + Fl{wi„){w - w\„) 

, {H^diag{-p')HWh„ ^ 2 (30) 

+ -t - ^kn) 

is an auxiliary function for f^„, which is relevant only 

to Wkn- 

Proof. Since G(w, w) = Fi(„{w) is obvious, we only need 
to show that G(w, vf^^) > f^„(vf). To do this, we compare 
the Taylor series expansion of Ff:f,(w), 

Fkniw) = Fk„(wL) + FUwUiw - wU 



kn'^'^kny^'kn'y"' " "'kn' 
FknH„)+F'knHn)i^-^\n) 



(H^diag{-p')H)^^{w-w[„f 

(31) 



with (30) to find that G{w, w^^) > Ff:f,(w) is equivalent to 

{H^diag{-p')HW)k„ 



(H'^diag{-p')H)^^ 
{H^diag{-p')HWhn > (H^diagi-p'W^^^w'j^^ 



(32) 



We have 



{H'^diag{-p')HW)kn = Y,{H'^diag{-p')H)km„w' 
1=1 

> (H^diag(-p')H)^^wl„. 



Thus, (32) holds and G(w, iv^^) > Fj^yiiw), 



(33) 
□ 



We can now demonstrate the convergence of 
Theorem 1. 



100 
80 

I 

I 60 
40 



L2Norm KL a-Divergence EMD MCC 
(fit) Leukemia 





100 




80 












60 




< 






40 



L2Norm KL a-Divergence EMD MCC 
(c) Lung Cancer 



100 
80 
60 
40 



^ ^ ^ 



+ 



1.2 Norm KL a-Divergence EMD 
(e) SRBCT 



MCC 



100 
^ 80 

g 60 
< 

40 



100 

80 

5 60 

o 

< 

40 

100 

^ 80 
o 

3 60 

o 

< 

40 





i i ^ - 


& ^■ 




L2Norm KL a-Divergence EMD MCC 
(b) Brain Tumor 






L 


L2Norm KL a-Divergence EMD MCC 
(d) 9 Tumors 











L2 Nornfi KL a-Divergence EMD 
(f)DLBCL 



MCC 



Figure 3 The boxplots of the clustering accuracies for NMF with different loss functions over 100 runs on the six gene expression 
datasets: (a) Leukemia, (b) Brain Tumor, (c) Lung Cancer, (d) 9 Tumors, (e) SRBCT, (f) DLBCL. 
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Proof of Theorem 1. Replacing G(w, w^) in (27) by (30) 
results in the update rule 



kn "^kn 



2{H'^diag{-p)HWi)kn 



^ {H'^diag{-p)X)kn 
^^{H^diag{-p)HWi)kn 



(34) 



Since (30) is an auxiliary function, Fj^yi is nonincreasing 
under this update rule as in (25). 

Similarly, we can also show that O is nonincreasing 
under the updating steps in (24). 

Experiments 

Datasets 

To test the proposed algorithm, we carry out exten- 
sive experiments on six cancer-related gene expression 
datasets. The six datasets consist of five multi-class 
sets as used in [4,23] and one binary class set [24]. 
The descriptions of the six datasets are summarized in 
Table 1. In these datasets, besides the gene expression 
data samples, the labels are also given. They were obtained 
from the diagnosis results and reported in different 
studies [23]. 



Performance metric 

The proposed NMF-MCC algorithm will be used to rep- 
resent gene expression data for k-means clustering. The 
clustering results are evaluated by comparing the obtained 
label of each sample with the label provided by the dataset. 
The clustering accuracy is used to measure the cluster- 
ing performance. Given a micro-array dataset containing 
N samples that belong to K classes, we assume that K is 
given in all the algorithms tested here. For each sample, 
Xny let Cn be the cluster label predicted by an algorithm and 
Vn be the cancer type label provided by the dataset. The 
accuracy of the algorithm is defined as: 



Eli/(r„,c„) 



Accuracy = 



N 



(35) 



where I(A,B) returns 1 if A = B and 0 otherwise. 



Tested methods 

We first compared the MCC with other loss functions 
between X and HW for the NMF algorithm on the 
cancer clustering problem, including I2 norm distance, 
KL distance [10], a -divergence [4], and earth movers 
distance (EMC) [12]. We further compared the pro- 
posed NMF-MCC algorithm with other NMF-based algo- 
rithms, including the penalized matrix decomposition 
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Figure 4 The boxplots of the clustering accuracies for different versions of NMF algorithms over 1 00 runs on the six gene expression 
datasets: (a) Leukemia, (b) Brain Tumor, (c) Lung Cancer, (d) 9 Tumors, (e) SRBCT, (f) DLBCL. 
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Figure 5 The gene weight vector learned by NMF-MCC with —p 
on the SRBCT dataset. 



(PMD) algorithm [7], the original NMF algorithm [22], the 
sparse non-negative matrix factorization (SNMF) algo- 
rithm [3], the non-smooth non-negative matrix factoriza- 
tion (nsNMF) algorithm [9] and the projected gradient 
kernel nonnegative matrix factorization (PGK-NMF). 

Results 

Since the initial H and W are selected randomly, we per- 
formed 100 independent trials and computed the average 
and the standard deviations of the accuracy for each loss 
function. The results from the comparison of MCC with 
other loss functions are presented in Figure 3. As shown 
in Figure 3, MCC consistently performed the best on 
all the six datasets. The other loss functions performed 
well on some datasets, but poorly on the others. It seems 
that the improvement of MCC increased when the num- 
ber of genes increased. The standard deviation on the 



accuracy of MCC was much smaller than the standard 
deviation on the other loss functions, indicating that MCC 
is the most stable. On the other hand, EMD, although 
worked quite well in computer vision tasks [12], it did 
not perform well on gene expression data due to the sig- 
nificant difference between the image data and the gene 
expression data. 

The results of the comparison of NMF-MCC with 
other related NMF methods are presented in Figure 4. 
Figure 4 shows the performance of different algorithms 
on the six datasets. The NMF-MCC algorithm outper- 
formed the other algorithms on five out of the six datasets. 
The NMF-MCC algorithm could correctly cluster more 
than 88% and 78% of the samples in the Leukemia and 
DLBCL datasets, respectively, in a completely unsuper- 
vised manner. In contrast, the I2 norm distance-based 
NMF algorithm performed even worse than the baseline 
PMD algorithm on the Leukemia and DLBCL datasets, 
i.e., an average accuracy of 73% and 67%, respectively. 
This verifies that correntropy is a much better measure 
of cancer clustering data. Note that NMF-MCC signif- 
icantly outperformed the other algorithms on the Lung 
Cancer dataset, which contains a large number of genes. 
This implies that among the large number of genes, 
only a small fraction is likely to be relevant to cancer- 
ous tumor growth or spread. In NMF-MCC, the auxil- 
iary variables —p acts as the feature selectors, we was 
able to select the relevant genes. Although the SNMF 
and nsNMF algorithms also improved on the perfor- 
mance of the baseline NMF algorithm, the improvement 
was much less than that of the NMF-MCC algorithm. A 
possible reason is that many genes exhibit similar pat- 
terns across all of the samples with only a few genes 
differentiating different cancer classes. They are likely 
to be sampled from a nonlinear manifold. Hence, the 
loss function defined by a linear kernel with either the 
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I2 norm or the KL distance could not capture them. In 
contrast, the NMF-MCC algorithm had a loss function 
that was defined by the correntropy and a Gaussian ker- 
nel, which could capture the nonlinear manifold structure 
much more effectively. By mapping the gene expression 
data into the nonlinear dataspace by a Gaussian kernel, 
the PGK-NMF outperformed the original NMF. How- 
ever, our NMF-MCC could even further improve the 
PGK-NMF by applying different kernels to different fea- 
tures. 

To understand what genes were selected by the NMF- 
MCC algorithm, we drew the gene weight figure on the 
SRBCT dataset (Figure 5). It can be seen that the —p 
vector is sparse, which shows the significance of cer- 
tain genes. The resulting meta-sample matrix weighted 
by —p with the corresponding coding matrix is shown 
in Figure 6. By comparing to the coding matrix learned 
by the original NMF with the I2 norm distance in 
Figure 1, we determine that the coding matrix learned by 
the NMF-MCC algorithm is much more discriminative 
among different cancer classes. On this dataset, the NMR- 
MCC algorithm achieved an average clustering accuracy 
of 63%. 

Discussion 

Traditional unsupervised learning techniques select fea- 
tures with features selection algorithms and then do 
clustering using the selected features. The NMF-MCC 
algorithm proposed here achieves both goals simulta- 
neously. The learned gene weight vector reflects the 
importance of the genes in the gene clustering task, and 
the coding matrix encodes the clustering results for the 
samples. 

Our experimental results demonstrate that the improve- 
ment of NMR-MCC over the other methods increases 
when the number of genes increases. This shows the 
ability of the proposed algorithm to effectively select 
the important genes and cluster samples. This is an 
important property because high-dimensional data anal- 
ysis has become increasingly frequent and important in 
diverse fields of sciences and engineering, and social sci- 
ences, ranging from genomics and health sciences to 
economics, finance and machine learning. For instance, 
in genome-wide association studies, hundreds of thou- 
sands of SNPs are potential covariates for phenotypes 
such as cholesterol level or height. The large number 
of features presents an intrinsic challenge to many clas- 
sical problems, where usual low-dimensional methods 
no longer apply. The NMF-MCC algorithm has been 
demonstrated to work well on the datasets with small 
numbers of samples but large numbers of features. It 
can therefor provide a powerful tool to study high- 
dimensional problems, such as genome-wide association 
studies. 



Conclusion 

We have proposed a novel NMF-MCC algorithm for 
gene expression data-based cancer clustering. Experi- 
ments demonstrate that correntropy is a better measure 
than the traditional I2 norm and KL distances for this task, 
and the proposed algorithm significantly outperforms the 
existing methods. 
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